citation quality
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- Europe > Ukraine > Kyiv Oblast > Kyiv (0.04)
- Asia > China > Jiangsu Province > Nanjing (0.04)
- (3 more...)
- Overview (1.00)
- Research Report > Experimental Study (0.93)
- Research Report > New Finding (0.93)
- Education (0.67)
- Information Technology > Security & Privacy (0.67)
Deep Literature Survey Automation with an Iterative Workflow
Zhang, Hongbo, Cui, Han, Wang, Yidong, Tian, Yijian, Guo, Qi, Wang, Cunxiang, Wu, Jian, Song, Chiyu, Zhang, Yue
Automatic literature survey generation has attracted increasing attention, yet most existing systems follow a one-shot paradigm, where a large set of papers is retrieved at once and a static outline is generated before drafting. This design often leads to noisy retrieval, fragmented structures, and context overload, ultimately limiting survey quality. Inspired by the iterative reading process of human researchers, we propose \ours, a framework based on recurrent outline generation, in which a planning agent incrementally retrieves, reads, and updates the outline to ensure both exploration and coherence. To provide faithful paper-level grounding, we design paper cards that distill each paper into its contributions, methods, and findings, and introduce a review-and-refine loop with visualization enhancement to improve textual flow and integrate multimodal elements such as figures and tables. Experiments on both established and emerging topics show that \ours\ substantially outperforms state-of-the-art baselines in content coverage, structural coherence, and citation quality, while producing more accessible and better-organized surveys. To provide a more reliable assessment of such improvements, we further introduce Survey-Arena, a pairwise benchmark that complements absolute scoring and more clearly positions machine-generated surveys relative to human-written ones. The code is available at https://github.com/HancCui/IterSurvey\_Autosurveyv2.
- Europe > Austria > Vienna (0.14)
- North America > United States > Maryland > Baltimore (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- (3 more...)
- Overview (1.00)
- Research Report > New Finding (0.68)
- Research Report > Experimental Study (0.46)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.94)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- Europe > Ukraine > Kyiv Oblast > Kyiv (0.04)
- Asia > China > Jiangsu Province > Nanjing (0.04)
- (3 more...)
- Overview (1.00)
- Research Report > Experimental Study (0.93)
- Research Report > New Finding (0.67)
- Health & Medicine (0.67)
- Education (0.67)
- Information Technology > Security & Privacy (0.67)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Cognitive Science (0.94)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.94)
SurveyGen: Quality-Aware Scientific Survey Generation with Large Language Models
Bao, Tong, Nayeem, Mir Tafseer, Rafiei, Davood, Zhang, Chengzhi
Automatic survey generation has emerged as a key task in scientific document processing. While large language models (LLMs) have shown promise in generating survey texts, the lack of standardized evaluation datasets critically hampers rigorous assessment of their performance against human-written surveys. In this work, we present SurveyGen, a large-scale dataset comprising over 4,200 human-written surveys across diverse scientific domains, along with 242,143 cited references and extensive quality-related metadata for both the surveys and the cited papers. Leveraging this resource, we build QUAL-SG, a novel quality-aware framework for survey generation that enhances the standard Retrieval-Augmented Generation (RAG) pipeline by incorporating quality-aware indicators into literature retrieval to assess and select higher-quality source papers. Using this dataset and framework, we systematically evaluate state-of-the-art LLMs under varying levels of human involvement - from fully automatic generation to human-guided writing. Experimental results and human evaluations show that while semi-automatic pipelines can achieve partially competitive outcomes, fully automatic survey generation still suffers from low citation quality and limited critical analysis.
- Asia (1.00)
- North America > United States (0.92)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Overview (1.00)
- Health & Medicine > Therapeutic Area > Neurology (1.00)
- Materials > Chemicals (0.67)
Lessons from Training Grounded LLMs with Verifiable Rewards
Sim, Shang Hong, Pala, Tej Deep, Toh, Vernon, Chieu, Hai Leong, Zadeh, Amir, Li, Chuan, Majumder, Navonil, Poria, Soujanya
Generating grounded and trustworthy responses remains a key challenge for large language models (LLMs). While retrieval-augmented generation (RAG) with citation-based grounding holds promise, instruction-tuned models frequently fail even in straightforward scenarios: missing explicitly stated answers, citing incorrectly, or refusing when evidence is available. In this work, we explore how reinforcement learning (RL) and internal reasoning can enhance grounding in LLMs. We use the GRPO (Group Relative Policy Optimization) method to train models using verifiable outcome-based rewards targeting answer correctness, citation sufficiency, and refusal quality, without requiring gold reasoning traces or expensive annotations. Through comprehensive experiments across ASQA, QAMPARI, ELI5, and ExpertQA we show that reasoning-augmented models significantly outperform instruction-only variants, especially in handling unanswerable queries and generating well-cited responses. A two-stage training setup, first optimizing answer and citation behavior and then refusal, further improves grounding by stabilizing the learning signal. Additionally, we revisit instruction tuning via GPT-4 distillation and find that combining it with GRPO enhances performance on long-form, generative QA tasks. Overall, our findings highlight the value of reasoning, stage-wise optimization, and outcome-driven RL for building more verifiable and reliable LLMs.
MedCite: Can Language Models Generate Verifiable Text for Medicine?
Wang, Xiao, Tan, Mengjue, Jin, Qiao, Xiong, Guangzhi, Hu, Yu, Zhang, Aidong, Lu, Zhiyong, Zhang, Minjia
Existing LLM-based medical question-answering systems lack citation generation and evaluation capabilities, raising concerns about their adoption in practice. In this work, we introduce \name, the first end-to-end framework that facilitates the design and evaluation of citation generation with LLMs for medical tasks. Meanwhile, we introduce a novel multi-pass retrieval-citation method that generates high-quality citations. Our evaluation highlights the challenges and opportunities of citation generation for medical tasks, while identifying important design choices that have a significant impact on the final citation quality. Our proposed method achieves superior citation precision and recall improvements compared to strong baseline methods, and we show that evaluation results correlate well with annotation results from professional experts.
- Europe (1.00)
- Asia (1.00)
- North America > United States > Illinois (0.14)
- Research Report > New Finding (1.00)
- Overview (0.93)
- Health & Medicine > Therapeutic Area > Oncology (1.00)
- Health & Medicine > Therapeutic Area > Neurology (1.00)
- Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
- (5 more...)
CiteEval: Principle-Driven Citation Evaluation for Source Attribution
Xu, Yumo, Qi, Peng, Chen, Jifan, Liu, Kunlun, Han, Rujun, Liu, Lan, Min, Bonan, Castelli, Vittorio, Gupta, Arshit, Wang, Zhiguo
Citation quality is crucial in information-seeking systems, directly influencing trust and the effectiveness of information access. Current evaluation frameworks, both human and automatic, mainly rely on Natural Language Inference (NLI) to assess binary or ternary supportiveness from cited sources, which we argue is a suboptimal proxy for citation evaluation. In this work we introduce CiteEval, a citation evaluation framework driven by principles focusing on fine-grained citation assessment within a broad context, encompassing not only the cited sources but the full retrieval context, user query, and generated text. Guided by the proposed framework, we construct CiteBench, a multi-domain benchmark with high-quality human annotations on citation quality. To enable efficient evaluation, we further develop CiteEval-Auto, a suite of model-based metrics that exhibit strong correlation with human judgments. Experiments across diverse systems demonstrate CiteEval-Auto's superior ability to capture the multifaceted nature of citations compared to existing metrics, offering a principled and scalable approach to evaluate and improve model-generated citations.
- North America > United States (1.00)
- Asia (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.98)
- Information Technology > Information Management (0.86)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.71)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)
PA-RAG: RAG Alignment via Multi-Perspective Preference Optimization
Wu, Jiayi, Cai, Hengyi, Yan, Lingyong, Sun, Hao, Li, Xiang, Wang, Shuaiqiang, Yin, Dawei, Gao, Ming
The emergence of Retrieval-augmented generation (RAG) has alleviated the issues of outdated and hallucinatory content in the generation of large language models (LLMs), yet it still reveals numerous limitations. When a general-purpose LLM serves as the RAG generator, it often suffers from inadequate response informativeness, response robustness, and citation quality. Past approaches to tackle these limitations, either by incorporating additional steps beyond generating responses or optimizing the generator through supervised fine-tuning (SFT), still failed to align with the RAG requirement thoroughly. Consequently, optimizing the RAG generator from multiple preference perspectives while maintaining its end-to-end LLM form remains a challenge. To bridge this gap, we propose Multiple Perspective Preference Alignment for Retrieval-Augmented Generation (PA-RAG), a method for optimizing the generator of RAG systems to align with RAG requirements comprehensively. Specifically, we construct high-quality instruction fine-tuning data and multi-perspective preference data by sampling varied quality responses from the generator across different prompt documents quality scenarios. Subsequently, we optimize the generator using SFT and Direct Preference Optimization (DPO). Extensive experiments conducted on four question-answer datasets across three LLMs demonstrate that PA-RAG can significantly enhance the performance of RAG generators. Our code and datasets are available at https://github.com/wujwyi/PA-RAG.
- Europe > Austria > Vienna (0.14)
- North America > United States > Washington > King County > Seattle (0.14)
- Asia > Singapore (0.04)
- (30 more...)
- Personal (1.00)
- Research Report (0.63)
- Media (1.00)
- Leisure & Entertainment > Sports > Olympic Games (1.00)
- Leisure & Entertainment > Sports > Motorsports > Formula One (1.00)
- (5 more...)
Advancing Large Language Model Attribution through Self-Improving
Huang, Lei, Feng, Xiaocheng, Ma, Weitao, Zhao, Liang, Fan, Yuchun, Zhong, Weihong, Xu, Dongliang, Yang, Qing, Liu, Hongtao, Qin, Bing
Teaching large language models (LLMs) to generate text with citations to evidence sources can mitigate hallucinations and enhance verifiability in information-seeking systems. However, improving this capability requires high-quality attribution data, which is costly and labor-intensive. Inspired by recent advances in self-improvement that enhance LLMs without manual annotation, we present START, a Self-Taught AttRibuTion framework for iteratively improving the attribution capability of LLMs. First, to prevent models from stagnating due to initially insufficient supervision signals, START leverages the model to self-construct synthetic training data for warming up. To further self-improve the model's attribution ability, START iteratively utilizes fine-grained preference supervision signals constructed from its sampled responses to encourage robust, comprehensive, and attributable generation. Experiments on three open-domain question-answering datasets, covering long-form QA and multi-step reasoning, demonstrate significant performance gains of 25.13% on average without relying on human annotations and more advanced models. Further analysis reveals that START excels in aggregating information across multiple sources.
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- Asia > Singapore (0.04)
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
- (8 more...)
- Media (0.47)
- Transportation (0.47)
On the Capacity of Citation Generation by Large Language Models
Qian, Haosheng, Fan, Yixing, Zhang, Ruqing, Guo, Jiafeng
Retrieval-augmented generation (RAG) appears as a promising method to alleviate the "hallucination" problem in large language models (LLMs), since it can incorporate external traceable resources for response generation. The essence of RAG in combating the hallucination issue lies in accurately attributing claims in responses to the corresponding retrieved documents. However, most of existing works focus on improving the quality of generated responses from the LLM, while largely overlooked its ability to attribute sources accurately. In this study, we conduct a systematic analysis about the capabilities of LLMs in generating citations within response generation, and further introduce a novel method to enhance their citation generation abilities. Specifically, we evaluate both the correctness and citation quality for seven widely-used LLMs on two benchmark datasets. Meanwhile, we introduce new citation evaluation metrics to eliminate the over-penalization of unnecessary and excessive citations in existing metrics. Furthermore, we propose a Generate-then-Refine method that completes relevant citations and removes irrelevant ones without altering the response text. The results on WebGLM-QA, ASQA and ELI5 datasets show that our method substantially improves the quality of citations in responses generated by LLMs.
- Research Report > New Finding (0.88)
- Research Report > Promising Solution (0.68)